Issues: classification engine + unified subject resolver + grouped triage UI by nadaverell · Pull Request #811 · skyhook-io/radar

nadaverell · 2026-05-28T12:49:03Z

End-to-end Issues overhaul in one PR: detection engine, identity layer, grouped triage UI, and the foundation hardening that came out of review.

Engine (`internal/issues`, `internal/k8s`)

Classify each Issue by symptom category + group; compose across the four detection sources.
Owner-resolved, deterministic grouping identity (stable issue_id keyed on subject + category).
Fold flat rows into the grouped issue model.

Unified subject resolver (`pkg/subject`) — new

One resolver for identity and app-grouping: Tier-1 Subject (owner-collapsed controller, derived from ownerRefs, zero setup) + Tier-2 AppOverlay (8-tier declared-key overlay — Flux / Argo / Helm / app.kubernetes.io/* — with provenance, confidence, and retained conflicts[]).
internal/issues/identity.go consumes it; StableID is byte-identical to the prior hash (no re-key).
The package doc is honest about scope: it's the shared identity vocabulary consumed by Issues today; pkg/topology consumption is staged for feat: Applications backend — /api/applications + PackageRow app-overlay #823, not yet wired (AppOverlay ships as tested-but-unconsumed scaffold for that work).

GA-blockers

Monotonic crashloop classification — key on RestartCount + LastTerminationState so oscillation no longer churns issue_id.
Double-row suppression — severity-gated, so a critical parent is never downgraded behind a warning child.
Critical-into-unknown gaps closed — PVC Lost, Job / CronJob failures (and CAPI + bad-image/container-create, below).
CRD-condition noise floor — transient-reason aware.

UI (`packages/k8s-ui` IssuesView)

Grouped, single-open-accordion triage queue; detector reason surfaced on the collapsed row.
Stable ordering keyed on first_seen (onset) on both server and UI — last_seen churns to compose-time every poll, so sorting on it reshuffles rows under auto-refresh. Stable row identity keeps the expanded card from dropping on refetch.
Onset age column (chronic-vs-acute signal) + "Started X ago · last seen Y ago" in the expanded body.
group / namespace optional on the Issue type to match the backend omitempty wire.

Hosts

Radar OSS mounts the shared IssuesView at a per-cluster /issues route (thin IssuesPane host over a useIssues → /api/issues hook), mirroring the hub's fleet ProblemsPage. Reachable but not yet linked in the nav — deliberate, so the surface lands before it's promoted.

Detection precision/recall — validated against real clusters

An empirical audit (GKE / EKS / kind) drove a precision/recall pass, each fix re-verified live:

GitOps reconciler health — new DetectGitOpsProblems surfaces ArgoCD Application (health Degraded/Missing, sync OutOfSync when automated, ComparisonError) + Flux Kustomization/HelmRelease (Ready=False, non-transient) — a class the generic CRD-condition fallback structurally can't read. gitops-demo went 0 → 7 GitOps issues. Flux source CRDs (GitRepository et al.) and Argo control-plane CRDs classify as operator_condition_failed, not force-fit into gitops_sync_failed.
Fewer false positives — recovered-after-crash pods no longer flagged; completed-Job pods skipped by missing-ref detection; scaled-to-0-backed Services labeled as such (including mid 1→0 scale-down); cadence-aware CronJob staleness.
Severity calibration — inert dangling refs (singleton-StatefulSet headless Service, deprecated GKE PSP RoleBindings) dropped from blanket critical.
New high_restart category for genuine thrash that isn't a classic CrashLoopBackOff.

Detection/classification gaps closed (second review pass)

CAPI now triageable — Cluster / KubeadmControlPlane → control_plane_not_ready; Machine / MachineDeployment / MachineHealthCheck → machine_not_ready (gated on cluster.x-k8s.io). Detection already existed; only the last classification hop was missing, so a control-plane outage was un-triageable.
Bad-image-tag / container-create failures detected — InvalidImageName / RunContainerError / CreateContainerError / ImageInspectError (shared isFatalWaitingReason; classified in lockstep). These were silently healthy before — InvalidImageName never self-resolves.
StatefulSet compares ready vs desired (spec), not just created (status.replicas), so an ordinal-0-wedged rollout surfaces.
Count is the subject-excluded fan-out size (matches the UI "Affected resources (N)" header + TS), not len(members).

Contract hardening (review)

Grouped filtering on the public shape — Compose folds to grouped before applying severity/kind + CEL, so kind=Deployment matches a pod-evidenced Deployment issue and count > N sees the member total.
RBAC-safe CRD namespace fanout — cluster-scoped CRDs use the gated cluster-wide list; explicit namespace sets list per-namespace; namespace-scoped informers are no longer silently dropped.
info excluded from the live queue — severity normalized to critical|warning (info stays honest at the Problem layer but isn't "what's broken now").
Owner grouping consistent — Argo automated.enabled:false treated as manual; cache-aware pod owner resolution (no phantom Deployments, in Issues and /top); pod missing-ref issues fold under their workload.
CEL — added first_seen, grouping_scope, restart_count, last_terminated_reason bindings (first_seen is the queryable onset axis; last_seen is near-useless for "older than…").

Foundation hardening (review-driven)

Detector output type Problem → Detection; the detector layer (problems.go) split into detection.go / capi.go / gitops.go with explicit layer framing.
Identity/status ownership moved off pkg/audit / pkg/packages onto neutral leaf packages: pkg/resourceid (ResourceKey + builtin Kind→Group) and pkg/conditions (transient-reason vocabulary + one shared FindFalseCondition). Future Applications/Packages surfaces depend on the leaves, not on audit/packages.
issues.go god-file split by concern (compose / source_conditions / normalize / dedupe / filters); Group → CategoryGroup; single shared SeverityRank (killed 3 clones); navigateToResourceList deduped in App.tsx.

Cross-repo dependency

Merge alongside radar-hub #57 — mirrors the new CEL bindings into the hub's filter env. Additive (no removed/renamed bindings), but without it the hub's pre-validation would reject fleet-issue filters that reference the new fields.

Precision + footgun detection (later commits)

A benchmark/empirical-driven hardening pass extended the engine with config-level root-cause detectors and precision fixes: faster clearing of recovered crashloops, multi-replica ReadWriteOnce volume conflicts, eviction-blocking PodDisruptionBudgets, rollout-deadlock root causes, and scheduler-verdict ordering by blast radius. Per-resource drill-down (get_resource / diagnose) was made owner-aware and uncapped so an object's issues agree whether you enter from the list or the resource, and classification was tightened (CAPI provider-CRD fallback, namespace-scoped CRD handling, Flux/Argo matching).

Deferred follow-ups (agreed in review):

Cross-subject incident correlation / blast-radius ranking (one ConfigMap → N workloads as one incident; PVC↔pod cross-layer dedup).
Provider → per-source interfaces (R2) — defer until the source set stabilizes.
Affected fixed struct → kind-keyed map (R10) — API debt; bundle with an IssueGroup/IssueEvidence DTO split.
Frontend queue-primitive extraction shared with Checks (B2/B6/B8).

Note

High Risk
Large refactor of the live issues compose path, grouping identity, and CEL contract (including removed cluster binding); behavior changes affect API/MCP consumers, sorting (first_seen), and what rows appear after dedup and noise suppression.

Overview
This PR restructures the issues pipeline from a flat compose path into classify → dedupe → optional grouped fold → filters → CEL, and splits the old monolithic issues.go into focused modules (compose, grouping, category, dedupe, source_conditions, etc.).

Symptom taxonomy: Each row gets a derived category and category_group via a new Classify() mapper over source/kind/reason (GitOps, CAPI, batch, storage, and pod failure shapes). enrichIdentity assigns stable id, grouping_scope, and owner-collapsed subjects via pkg/subject, with fingerprints so distinct missing-ref causes on the same workload do not collapse.

Grouped public model: GroupIssues folds replica fan-out into one row per subject+category (count = non-subject members, inline members capped). Filters.Grouped applies severity/kind/CEL after folding so kind=Deployment and count > N match the triage shape. Dedup prefers scheduling over generic pod problems and drops workload_degraded / rollout_stalled when an equal-or-higher-severity child symptom exists.

CEL / wire: Issue filters gain category, category_group, first_seen, grouping_scope, restart_count, last_terminated_reason; cluster is removed from bindings. Info-severity detections are dropped at compose; issues stay critical|warning only.

Detection layer: k8s.Problem becomes Detection with richer pod owner, restart, and fingerprint fields; DetectGitOpsProblems is composed as its own source. Generic CRD conditions use pkg/conditions, transient/suspend/observedGeneration noise floor, kind-specific curated skips (CAPI core vs provider CRDs), and ListDynamicAllNamespaces for cluster-wide namespaced CRD scans. detect.go adds precision fixes (RWO multi-replica conflicts, PDB eviction blocks, WaitForFirstConsumer PVC suppression, service scale-to-zero labeling, deployment rollout vs degraded dedup at detection).

Shared packages: pkg/resourceid, pkg/conditions, and pkg/subject centralize resource keys, condition reading, and stable IDs so issues/topology/audit do not drift.

^{Reviewed by Cursor Bugbot for commit 518e8d4. Bugbot is set up for automated code reviews on this repo. Configure here.}

nadaverell · 2026-05-28T22:49:49Z

Follow-up 3512f39: the detector reason/message now rides the collapsed row (was expand-only), filling the dead band between the title and the severity badge so the highest-value triage signal reads without a click. Full text + crash context stay in the expanded body. Also makes Issue.group/namespace optional to match the omitempty wire.

Consumer: the hub fleet Issues page lands in radar-hub-web #85.

Adds a pure, deterministic classifier (Category, with a fixed Category→Group rollup) over the signal radar already emits — Source + Kind + Reason + crash context — and wires it into Compose so every /api/issues and MCP `issues` row carries `category` + `category_group`. Both are server-emitted labels (the UI renders the rollup without its own category→group map) and both are exposed as CEL filter bindings. `unknown` is first-class: categories whose detectors don't exist yet, plus CronJob / Job / CAPI / PVC-Lost / Node-Cordoned, fall through to it rather than being force-fit into a neat bucket.

Every issue now carries three additive identity fields: - Owner: the topmost stable controller of a Pod problem (Pod→Deployment, not the intermediate ReplicaSet), resolved at detection time via the existing topOwnerForPod and carried on k8s.Problem alongside the RestartCount/LastTerminatedReason crash context. - GroupingScope: workload|service|pvc|ingress|node|unknown — the subject's coarse bucket (drives the future UI section, part of the ID). - ID: deterministic cluster-local hash(scope, subject key, category), identical for every member row that rolls up to the same subject+category. The hub namespaces it by cluster_id for global uniqueness. Subject = the topmost owner when one was resolved (member pods key on their workload), else the resource itself. resourceKey reuses pkg/audit.ResourceKey so issue grouping and audit deep-links share one key format rather than drifting. Purely additive — rows are not yet collapsed; the shared ID is the handle the collapse fold keys on (next slice). No consumer contract changes.

GroupIssues collapses the flat evidence rows into the public operational model — one row per shared id (subject+category). A Deployment whose 3 pods all ImagePullBackOff is one issue with affected:{pods:3} + bounded member refs, not three rows. - /api/issues + MCP issues return grouped rows by default; the cap now counts issue groups, not replica fan-out. - /api/issues?view=flat returns the raw pre-fold evidence rows for debugging ("what folded into this group?"). MCP stays grouped-only — agents use get_resource/get_events for raw state. - Compose() stays flat internally, so summarycontext's per-resource index is unchanged; Filters.Grouped gates the fold. - Representative rules (deterministic): severity = max member, category = shared, subject = topmost owner, reason/message/crash-context from the worst member, age = oldest onset, last_seen = newest, members sorted + capped at 10 with members_truncated past that. Table-tested — grouping bugs are trust bugs; every consumer inherits them.

The presentation sibling to the Checks queue (ChecksView): one row per grouped issue (subject + category), severity rail + pill, single-open accordion expanding to the diagnosis (reason/message + pod crash context) plus the subject and affected-member deep-links. Consumed by BOTH OSS single-cluster and the hub fleet view (the host wires resourceHref / onResourceClick / clusterLabel) so the two surfaces can't diverge. Reuses the established shared atoms (ClusterName, EmptyState) and the EXACT Checks severity hues (critical=red, warning=amber = Checks medium) so the two queues read as one product. Identity (IssueResourceRef + resourceKey) matches the Checks contract + audit.ResourceKey; named IssueResourceRef to avoid colliding with the core single-cluster ResourceRef (same reason Checks uses CheckResourceRef). Faceting stays the host page's job (FleetPageShell), so there are no in-component filter chips. Types mirror radar's grouped Issue (internal/issues.GroupIssues).

…es identity New pkg/subject is the one resolver the platform plan calls the #1 prerequisite: - Tier-1 Subject = owner-collapsed root controller (Pod->RS->Deployment, Pod->Job->CronJob), with explicit bare/Node/operator-CR anchors. Deterministic, label-free. Walks via injected OwnerResolver/OperatorRootHook so the package imports neither internal/* nor pkg/topology. - Tier-2 AppOverlay = 8-tier declared-key precedence (Flux/Argo/Helm tiers 1-5 consolidated from managedby.go in its native order argo-instance<Helm; labels 6-8 net-new) with provenance + confidence + retained conflicts[]; nil when raw wins (bare-app opt-in). - IssueID/CheckID + ScopeForKind move here. IssueID is byte-identical -> no re-key. - internal/issues/identity.go:enrichIdentity migrated to consume it (Scope aliased). Topology determineGroupKey migration (consume Subject for identity + AppOverlay for grouping) is the next step. Verified: go build/vet/test green for pkg/subject + internal/issues.

…aps, CRD noise #1 detector monotonicity (internal/k8s/health.go): classify crashloop from RestartCount + LastTerminationState, stable across Waiting->Running->Waiting so the category-hashed issue_id stops churning. #2 suppress parent workload_degraded/rollout_stalled when an equal-or-worse child symptom exists on the same subject — severity-gated so a critical rollup is never downgraded to a warning child. #3 PVC Lost + Job/CronJob failures classified (were discarded to unknown). #5 CRD-condition noise floor: transient-aware via shared packages.IsTransientConditionReason (one source of truth with the GitOps path). + table tests. Verified: go build/vet/test green (internal/issues, internal/k8s, pkg/packages).

…c cycles, rename heuristic resolver pkg/subject is the canonical identity primitive #823 will build on, so its contract must be exact before that wiring lands. Review-driven: - OwnerResolver doc now states the contract explicitly: Kubernetes CONTROLLER ownership only (ownerReferences[].controller chain), NOT declarative/app management. Removes the trap that pointed the topology adapter at walkTopmostOwner — which follows every EdgeManages edge, and EdgeManages is overloaded to include GitOps/Helm management (Argo App→resource, GitRepo→Kustomization). Wrapping it would collapse a Deployment's Subject up to its Application, erasing the Tier-1/Tier-2 boundary. #823 must resolve from controllerReferences (k8s.topOwnerForPodResolved is the reference impl). TestResolveSubject_StopsAtController pins it: Pod→RS→Deployment yields Deployment even when an Application manages it. - Ownership cycles now resolve to a deterministic, start-independent representative (min refKey) instead of the last-hop-before-revisit (which gave a→b but b→a). A canonical identity can't depend on traversal start. TestResolveSubject_CycleIsDeterministic pins it. - PodOwnerResolver → HeuristicPodOwnerResolver: it fabricates a Deployment from any ReplicaSet by hash-stripping (wrong for Rollouts/custom controllers). It has zero production callers (test-only); the name now carries the caveat so it can't be mistaken for canonical. Dropped the doc's false claim that issues passes it (issues uses only ScopeForKind + StableID).

…umented Round 2 of the canonical-primitive review — close the paths back to 'management edge as identity': - HeuristicPodOwnerResolver no longer falls back to refs[0] when there's no controller ownerRef. The contract says controller-only; the fallback collapsed a pod under an arbitrary NON-controller owner. Now a pod with only non-controller refs resolves to itself (bare). New test pins both arms. - Resolve doc corrected: obj feeds ONLY the Tier-2 overlay; ownership ALWAYS comes from the injected resolver. Resolve(ref, obj, nil, …) yields a bare Subject even if obj has ownerRefs — the previous 'obj supplies BOTH' framing was misleading. Spelled out that a pod owner-walk needs an injected resolver. - Scrubbed EdgeManages / walkTopmostOwner from Tier-1 language in the Subject doc and OwnerLookup (operatorroots.go) — those comments are the #823 integration guide, and 'derivable from the EdgeManages chain' reintroduced the exact ambiguity the contract fix removes. Now: controller ownerReferences / controllerRef-derived edges only.

Cross-cluster audit (gke-management) surfaced a false-positive class: PVCs bound to a WaitForFirstConsumer StorageClass sit in Pending BY DESIGN until a consuming pod is scheduled — dormant/scaled-to-zero/orphaned volumes stay there forever and aren't a fault. The detector flagged every Pending PVC >5min regardless of binding mode, lighting up benign awaiting-consumer volumes. pvcAwaitsFirstConsumer resolves the PVC's StorageClass (explicit or cluster default) and suppresses the row when binding mode is WaitForFirstConsumer — a genuinely-stuck consumer still surfaces as an unschedulable pod via the scheduling source. Immediate-binding Pending (real provisioning failure) and missing-StorageClass (separate detector, critical) are unaffected; verified the kind-bench missing-SC PVC still flags. Also added a message on the remaining Pending rows (was bare 'Pending').

…tectors, chronic onset, total rep order Review round on the identity/grouping spine: - ID discriminator gains a stable cause Fingerprint so distinct causes on one subject+category no longer collapse into a single row (a workload missing both a ConfigMap AND a Secret was one missing_config_ref showing only one target). The missing-ref detector sets it from the target-bearing message (stable, deterministic — NOT the flapping reason, which would re-key on refresh); unknown keys on source+reason; every single-cause category stays category-only and byte-identical (no re-key). Test pins split + fold + no-re-key. - Curated CAPI/GitOps detectors route the all-scope path through ListWatched instead of List(gvr,"") — the latter is cluster-wide-only and silently drops namespace-scoped contents in namespace-restricted installs (the generic fallback already did this). Verified no regression on a cluster-wide install. - FirstSeen no longer resets to now for detectors without a duration: HPA/CronJob now stamp AgeSeconds (resource age), and fromProblem falls back to AgeSeconds when DurationSeconds is 0 — chronic issues stop sorting as fresh. - betterRepresentative is now a true total comparator (group/kind/ns/name/source/ reason/message), so the donated representative is deterministic regardless of input order. (The 5th finding — HeuristicPodOwnerResolver refs[0] fallback — was already fixed in 2b16a5e.)

…ive detail - PodSecurityViolation gets its own category (pod_security_violation, security group) instead of the misleading admission_webhook_blocking — PSA is built-in admission, not a webhook. cert-manager: only Kind=Certificate maps to certificate_not_ready; Issuer/ClusterIssuer/Order/Challenge → operator_condition (a not-ready Issuer isn't a certificate problem). Fixed before these harden into IDs/filters. - GitOps failures no longer under-ranked vs the detail view: Argo ComparisonError/SyncError/InvalidSpecError and auto-sync HealthMissing, and Flux Ready=False (genuine reason), are now critical (were warning). OutOfSync drift stays warning (self-heals). - Decisive detail: Argo HealthDegraded/HealthMissing now carry the app's real status.health.message; CAPI Cluster/Machine Failed phase carry status.failureMessage/failureReason (capiFailureDetail). Tests updated.

…inks - Extract issues.ListResponse + NewListResponse; /api/issues (HTTP) and the MCP issues tool both build from it instead of hand-rolling identical maps, so the contract can't drift (and the hub mirrors one shape). MCP keeps its narrowHint; both keep visibility. Wire output unchanged. - Local Issues view surfaces truncation: useIssues now carries total/ total_matched, and IssuesPane shows 'Showing N of M (capped)' when the queue was windowed — local Radar no longer presents a capped list as complete. - Issue/Audit resource clicks encode the opened resource in the URL (?resource=ns/name, the resources view's own deep-link shape) so refresh/share keeps the drawer open instead of dropping it.

…nt contract The issues tool exposes grouped logical subjects, but the agent's follow-up (get_resource/list_resources/search/diagnose) contradicted it. Three fixes so an AI SRE agent gets the same picture on drill-down: - BuildIssueIndex now composes GROUPED issues and counts each against its subject AND every affected member, so get_resource Deployment surfaces the Deployment-grouped crashloop (was issueCount=0 — keyed only by the evidence Pod). issueCount is now consistent with the issues tool. - diagnose returns relatedIssues — the grouped issues whose subject or member is the diagnosed object — so the agent sees 'crashloop + missing ConfigMap + HPA can't-scale' up front instead of re-deriving from raw logs. (No issue-id input, per scope.) - The issues CEL filter schema now documents all runtime bindings (category, category_group, grouping_scope, restart_count, last_terminated_reason, first_seen) and leads with first_seen — agents were missing strong filters like category_group=='startup' or restart_count>10.

…ity honesty, ref optionality, golden tests From the MCP-focused review (do-now set; pushed back on the overzealous items): - CAPI MachineDeployment/MachineHealthCheck (cluster.x-k8s.io) and KubeadmControlPlane (controlplane.cluster.x-k8s.io) now use group-qualified discovery instead of kind-only — can't attach to a same-kind CRD from another group. - HPA gets a cause fingerprint ('hpa:<problem>') so one HPA that's BOTH maxed and unable-to-scale surfaces as two distinct issues (distinct fixes) rather than collapsing. (Targeted — not blanket fingerprinting every detector.) - IssuesPane surfaces incomplete RBAC visibility as a caveat banner, so an empty queue under degraded visibility reads 'limited visibility' not 'nothing broken' (useIssues now carries visibility). - IssueResourceRef group/namespace are optional to match the Go omitempty wire (a Node/core-group member arrives without them); callers default to ''. - Copy: 'grouped by the resource they affect' (scopes span service/PVC/node/…, not just workloads). - Golden contract tests: a Pod-evidenced issue surfaces on its owning Deployment in the issue index (get_resource) and in RelatedIssues (diagnose) + on the Pod. Held for a Hub-coordinated pass: cluster CEL binding (#5/#9). Pushed back: cold-cache owner churn (transient/rare), affected-truncation (not reachable in OSS), source-vocab (opaque pass-through; only stale test fixtures).

…#1 + #13) Conceded from review: get_resource/diagnose built issueSummary via a SEPARATE flat-by-exact-resource path (computeMCPIssueSummary / computeIssueSummaryForResource) that BuildIssueIndex never touched — so get_resource Deployment/web returned an empty summary while the issues tool showed it broken. And both BuildIssueIndex and RelatedIssues iterated the grouped issue's inline Members (capped at maxInlineMembers=10), dropping pods #11+ of a large fan-out. Both now resolve a resource's grouped issues from FLAT evidence (uncapped) + the resolved owner, keyed/deduped per grouped-issue ID: - RelatedIssues matches grouped subjects AND every flat evidence row. - BuildIssueIndex counts distinct grouped issues per resource (each evidence resource + its owner). - computeMCPIssueSummary + computeIssueSummaryForResource route through RelatedIssues, so get_resource on a workload (or any affected pod, capped or not) surfaces the same grouped issues the issues tool shows. Tests pin owner-rollup AND the uncapped (>10 members) case.

- #2 (Service): a Service that is BOTH no-ready-endpoints AND has an unresolved named targetPort now stays two issues (distinct fixes — the workload vs the Service port spec) via stable fingerprints, instead of collapsing under one service_no_endpoints row. - #5 (cluster): removed the 'cluster' CEL binding + the always-empty Issue.Cluster field + its activation projection. A single Radar is one cluster, so Issue.Cluster was always empty and a forwarded 'cluster == x' matched nothing — the advertised filter returned the wrong answer. Cross-cluster scoping is the hub's clusters=/target mechanism (applied at fan-out), not a per-issue predicate. MCP/filter docs updated to say so.

… sort - #7: issues(namespace="prod") for a namespace the caller can't access now returns 403 forbidden (MCP + REST), not an empty list. 'unauthorized' must not read as 'nothing broken' — a real trust gap for an SRE agent. The no-explicit-namespace path still returns empty (the caller asked for nothing specific). - #9: lessIssue tiebreak is now (namespace, name, id) — byte-identical to the shared UI comparator's single-cluster order (the UI's only extra key, cluster, is constant for one cluster). The parity claim in the comment is now true instead of aspirational.

…O conflicts, order scheduler clauses by blast radius - isStableCrashLoop trusts a probe-gated Ready as a recovery signal, so a container that crashed at startup and is now serving clears immediately instead of reading crashloop-critical for the full 5m Running window. Gated on a readiness probe being defined (without one, Ready just mirrors Running and flips during a loop's between-crash blip). - New volume_access_mode_conflict detector: a Deployment wanting >1 replica that mounts a ReadWriteOnce PVC is flagged with the fix (RWX / StatefulSet volumeClaimTemplates / 1 replica) — the config-level root cause, named from spec, distinct from the observed multi-attach symptom. - summarizeReasons orders scheduler clauses by nodes-rejected descending, so the widest-blast-radius constraint leads instead of the scheduler's arbitrary predicate order.

…ollout deadlocks - New pdb_blocks_evictions detector: a PodDisruptionBudget that allows 0 voluntary disruptions while all selected pods are healthy (maxUnavailable=0 or minAvailable>=replicas) silently blocks node drains and cluster upgrades. Keyed on status.DisruptionsAllowed==0 with structural guards (observed generation current, healthy>=desired) so transient zero-budget during a real outage isn't flagged. - Enrich the existing 'Rollout stuck' row: when a stuck Deployment mounts a ReadWriteOnce PVC and isn't strategy: Recreate, append the root cause + fix. The surge pod can't attach the volume the old pod holds — the classic rollout deadlock, named on the row that's already firing (no new noise).

…s, fix merged doc, complete category labels - normalizeImagePullMessage: drop the redundant 'lookup .* no such' branch (already covered by 'no such host') — removes the unbounded .* CodeQL flagged. - Remove unused resourceKey/resourceRefKey from the issues module (Checks owns its own copy; no issues consumer). - Restore pvcAwaitsFirstConsumer's doc comment (had merged into resourceAge's). - Add curated labels for pod_security_violation, control_plane_not_ready, machine_not_ready.

…a goal, Flux match precision, Argo reason format; CAPI/namespace CRD fallback Detector: - Deployment desired count uses schedDesiredReplicas (spec is the goal; nil→1; a scale-down's terminating pods no longer inflate the denominator). - A ProgressDeadlineExceeded rollout supersedes the workload_degraded row for the same Deployment — one incident, not two redundant rows. Classification: - Flux group match tightened 'fluxcd' → 'fluxcd.io', consistent with the sibling argoproj.io / cert-manager.io matches and collision-safe. - Argo Rollout Reason formatted via condTypeReason ('Progressing: ProgressDeadlineExceeded') to match every other condition row; InvalidSpec guarded against doubling when reason restates the type. Generic CRD fallback: - isCuratedCRDKind is kind-specific for CAPI (core Cluster/Machine/KCP/MHC), so provider CRDs (AWSMachine, bootstrap configs) still get the generic condition fallback instead of being silently skipped. - Skip cluster-scoped CRDs when an explicit namespace filter is set. Tests cover CAPI provider-CRD fallback and namespace-scoped skip.

cursor · 2026-05-31T13:54:41Z

+		if r != "" && r != "InvalidSpec" {
+			reason = condTypeReason("InvalidSpec", r)
+		}
+		return reason, m, true


Variable shadows named return in argoRolloutFailure

Low Severity

Inside argoRolloutFailure, the local declaration reason := "InvalidSpec" shadows the function's named return parameter reason. The code works correctly because explicit return values are used, but the shadowing is a readability trap — a future maintainer adding a bare return or reading the logic could easily confuse the two scopes.

^{Reviewed by Cursor Bugbot for commit dd77b1d. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 0b45f00. Configure here.}

cursor · 2026-05-31T14:09:41Z

+			// Source CRDs (GitRepository/OCIRepository/Bucket/HelmChart/
+			// HelmRepository) failing to fetch is a source/reconcile failure,
+			// not a sync — don't inherit the applier's category.
+			return CategoryOperatorConditionFail


Misleading comment on unreachable Flux classification branch

Low Severity

The case g == "kustomize.toolkit.fluxcd.io" || g == "helm.toolkit.fluxcd.io": branch in the SourceCondition handler of Classify is entirely dead code. isCuratedCRDKind marks both Kustomization and HelmRelease as curated, causing detectGenericCRDIssues to skip them before they ever reach classification via the condition path. Since Flux assigns exactly one kind per group, no other kind from these groups exists to reach this branch. The comment at lines 243–245 compounds the confusion by referencing "Source CRDs (GitRepository/…)" which live in an entirely different group (source.toolkit.fluxcd.io).

Additional Locations (1)

internal/issues/source_conditions.go#L200-L204

^{Reviewed by Cursor Bugbot for commit 0b45f00. Configure here.}

+			for _, ns := range f.Namespaces {
+				its, err := p.ListDynamic(gvr, ns)
+				if err != nil {
+					log.Printf("[issues] Failed to list %s (%s) in %s: %s", logsafe.Sanitize(gvr.Resource), logsafe.Sanitize(gvr.Group), logsafe.Sanitize(ns), logsafe.Sanitize(err.Error()))


nadaverell requested a review from hisco as a code owner May 28, 2026 12:49

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated

Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated

Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated

nadaverell force-pushed the feat/issues-ui branch 2 times, most recently from 5333aad to 0f6b860 Compare May 28, 2026 17:25

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/types.ts Outdated

nadaverell force-pushed the feat/issues-ui branch from 90e9db9 to afc4a79 Compare May 29, 2026 22:19

nadaverell changed the title ~~k8s-ui: shared IssuesView — grouped live-issue triage queue~~ Issues: classification engine + unified subject resolver + grouped triage UI May 29, 2026

nadaverell mentioned this pull request May 29, 2026

Issues: symptom classification + owner-grouping (engine) #803

Closed

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/severity.ts

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread pkg/packages/gitops.go Outdated

nadaverell mentioned this pull request May 29, 2026

feat: scheduling-blocker + admission-failure detection (ResourceQuota / LimitRange / PodSecurity / webhook) #826

Closed

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread pkg/packages/gitops.go

nadaverell force-pushed the feat/issues-ui branch from cc8b88f to e779bde Compare May 30, 2026 00:35

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread internal/issues/grouping.go

nadaverell force-pushed the feat/issues-ui branch from e779bde to e12212b Compare May 30, 2026 12:58

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/types.ts

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx

nadaverell mentioned this pull request May 30, 2026

Issues foundation hardening: dependency graph, vocabulary, structure #828

Merged

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread internal/issues/grouping.go

Comment thread internal/issues/category.go

nadaverell mentioned this pull request May 30, 2026

fix: 404 the whole /.well-known/* tree for MCP discovery probes #829

Merged

cursor Bot reviewed May 30, 2026

View reviewed changes

Comment thread internal/k8s/gitops.go Outdated

Comment thread internal/k8s/gitops.go Outdated

nadaverell added 8 commits May 30, 2026 22:02

k8s-ui(issues): group/namespace optional to match omitempty wire

cad50c6

k8s-ui(issues): onset age column (chronic-vs-acute triage signal)

766b022

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread internal/k8s/detect.go Outdated

Comment thread internal/issues/source_conditions.go

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread packages/k8s-ui/src/components/issues/types.ts Fixed

nadaverell added 4 commits May 31, 2026 05:34

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread pkg/subject/subject.go Fixed

nadaverell added 3 commits May 31, 2026 09:10

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread internal/k8s/detect.go

Comment thread internal/k8s/detect.go Outdated

nadaverell added 3 commits May 31, 2026 10:56

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread internal/issues/grouping.go

cursor Bot reviewed May 31, 2026

View reviewed changes

Comment thread internal/issues/source_conditions.go

Comment thread internal/k8s/detect.go Outdated

nadaverell added 4 commits May 31, 2026 12:31

cursor Bot reviewed May 31, 2026

View reviewed changes

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread internal/issues/source_conditions.go Fixed

nadaverell force-pushed the feat/issues-ui branch from dd77b1d to 0b45f00 Compare May 31, 2026 13:59

cursor Bot reviewed May 31, 2026

View reviewed changes

fix(issues): address final review findings

518e8d4

nadaverell force-pushed the feat/issues-ui branch from 0b45f00 to 518e8d4 Compare May 31, 2026 14:14

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

nadaverell merged commit 189d684 into main May 31, 2026
9 checks passed

nadaverell mentioned this pull request May 31, 2026

feat: Applications backend — /api/applications + PackageRow app-overlay #823

Open

Conversation

nadaverell commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Engine (internal/issues, internal/k8s)

Unified subject resolver (pkg/subject) — new

GA-blockers

UI (packages/k8s-ui IssuesView)

Hosts

Detection precision/recall — validated against real clusters

Detection/classification gaps closed (second review pass)

Contract hardening (review)

Foundation hardening (review-driven)

Cross-repo dependency

Precision + footgun detection (later commits)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nadaverell commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 31, 2026

Choose a reason for hiding this comment

Variable shadows named return in argoRolloutFailure

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 31, 2026

Choose a reason for hiding this comment

Misleading comment on unreachable Flux classification branch

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nadaverell commented May 28, 2026 •

edited by cursor Bot

Loading

Engine (`internal/issues`, `internal/k8s`)

Unified subject resolver (`pkg/subject`) — new

UI (`packages/k8s-ui` IssuesView)